Strangely, Matrix Multiplications on GPUs Run Faster When Given “Predictable” Data! [short]

omnivore gpu-programming good-read!

Read on Omnivore | Read Original

Highlights

CUTLASS’s profiler, by default, actually initializes the values in a fairly strange way - it only initializes the inputs with integers. Confused about whether this matters, I try:

zero_inputs = torch.zeros(N, N)
randn_inputs = torch.randn(N, N)
benchmark(zero_inputs) # 295 Teraflops
benchmark(randn_inputs) # 257 Teraflops

What? How could the values of the matrix affect the runtime of the model? ⤴️

dynamic (or switching) power, is the culprit. Specifically, a small amount of power is consumed whenever a transistor switches states. If the transistor never needs to switch states, it doesn’t consume any extra power. On the other hand, if it’s rapidly flipping, then it consumes a ton of dynamic/switching power. Multiply that by the billions of transistors in your GPU, and you get the overall increase in power consumption. ⤴️

In other words, the reason why matrix multiplications are faster when passed zeros is that this reduces the “flipping” of enough transistors in the chip to stay under the power limit! ⤴️

In other words, matmuls on the H100 are primarily not compute or bandwidth limited, they are power limited. ⤴️